── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose() masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract() masks R.utils::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::partial() masks pryr::partial()
✖ dplyr::where() masks pryr::where(), gtsummary::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
The following object is masked from 'package:pryr':
address
In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
Q1. Visualizing patient trajectory
Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.
Q1.1 ADT history
A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.
Do a similar visualization for the patient with subject_id 10063848 using ggplot.
Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.
Solution:
# Load datasets as tblespatients <-read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 364627 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 546028 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 2413581 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): eventtype, careunit
dbl (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 859655 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 6364488 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 86423 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 112107 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# labeventsdbExecute(con, "CREATE TABLE Lab AS SELECT subject_id,hadm_id,labevent_id, charttime, FROM read_parquet( '~/biostat-203b-2025-winter/hw3/labevents_parquet/' || 'part-0.parquet' ) WHERE subject_id = 10001217")
[1] 353
# ADT dbExecute(con, " CREATE TABLE ADT AS SELECT subject_id,hadm_id, careunit,eventtype, intime,outtime, FROM transfers")
[1] 12
# ProceduresdbExecute(con, " CREATE TABLE Procedure AS SELECT p.subject_id,p.hadm_id,chartdate, dp.icd_code,dp.icd_version,dp.long_title, FROM procedures p LEFT JOIN d_icd_procedures dp ON p.icd_code = dp.icd_code and p.icd_version = dp.icd_version")
[1] 4
#ADT historydbExecute(con, " CREATE TABLE ADT_history AS SELECT a.subject_id,a.hadm_id,a.careunit, a.eventtype,a.intime,a.outtime, p.chartdate,p.long_title, p.chartdate, FROM ADT a LEFT JOIN Procedure p ON a.subject_id = p.subject_id and a.hadm_id = p.hadm_id")
[1] 24
# Collect ADT Dataadt_data <-tbl(con, "ADT_history") %>%collect() %>%mutate(intime =as.POSIXct(intime, format="%Y-%m-%d %H:%M:%S"),outtime =as.POSIXct(outtime, format="%Y-%m-%d %H:%M:%S")) %>%filter(careunit !="UNKNOWN", outtime > intime) # Collect Procedure Dataprocedure_data <-tbl(con, "Procedure") %>%collect() %>%mutate(chartdate =as.POSIXct(chartdate, format="%Y-%m-%d"))# Collect Lab Data (Ensure One Cross Per Day)lab_data <-tbl(con, "Lab") %>%collect() %>%mutate(charttime =as.POSIXct(charttime, format="%Y-%m-%d %H:%M:%S"))# Create Final Plotfinal_plot <-ggplot() +# Procedure Eventsgeom_point(data = procedure_data, aes(x = chartdate, y ="Procedure", shape = long_title), size =5) +# Lab Events (Cross Shape)geom_point(data = lab_data, aes(x = charttime, y ="Lab"), shape =3, size =4, stroke =1.5) +# ADT timelinegeom_segment(data = adt_data, aes(x = intime, xend = outtime, y ="ADT", yend ="ADT", color = careunit, size =ifelse(careunit %in%c("Surgical Intensive Care Unit (SICU)", "Medical Intensive Care Unit (MICU)"), 10, 5))) +# Manual Adjustmentsscale_shape_manual(values =c(16, 17, 15)) +# Different Shapes for Proceduresscale_size_identity() +scale_x_datetime(date_labels ="%b %d", date_breaks ="1 week") +# Axis Labels and Themeslabs(title =paste("Patient", "and Procedure History"),x ="Calendar Time",y ="",color ="Care Unit",shape ="Procedure") +theme_minimal() +theme(legend.position ="bottom", # Move legend to bottomaxis.text.y =element_text(size =10), # Ensure Y-axis labels are readableaxis.ticks.y =element_blank()) +guides(color =guide_legend(nrow =2), shape =guide_legend(nrow =2)) # Split legend into two rows
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in rm(admissions_data, lab_data, final_plot, transfers_data,
procedure_data, : object 'procedure_plot' not found
Warning in rm(admissions_data, lab_data, final_plot, transfers_data,
procedure_data, : object 'adt_plot' not found
rm(con)
Q1.2 ICU stays
ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.
Do a similar visualization for the patient 10063848.
Rows: 94458 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): first_careunit, last_careunit
dbl (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q2.2 Summary and visualization
How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:arrow':
schema
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Count ICU stays per subjecticu_stays_per_subject <- icustays_tble %>%group_by(subject_id) %>%summarise(num_stays =n(), .groups ="drop")# Create a histogram to visualize the distributionp <-ggplot(icu_stays_per_subject, aes(x = num_stays)) +geom_histogram(binwidth =1, fill ="#1B9E76", color ="black", alpha =0.7) +# Use histogram with bin width 1scale_x_continuous(limits =c(0, 30)) +# Limit X-axis to focus on the majority of datalabs(title ="Distribution of ICU Stays per Subject",x ="Number of ICU Stays",y ="Count of Subjects") +theme_minimal() # Use a clean theme for better readability# Convert to an interactive plotggplotly(p)
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).
rm(icu_stays_per_subject,num_unique_subjects)
The histogram shows the distribution of ICU stays per subject (patient). Most subjects have only one ICU stay, with a sharp drop-off as the number of stays increases. The majority of patients (49,124 subjects) had a single ICU stay, while a much smaller proportion experienced multiple ICU stays. This suggests that repeated ICU admissions are uncommon, but some patients do return multiple times, possibly due to chronic conditions, post-surgical complications, or severe illnesses requiring multiple admissions. The long tail of the distribution indicates that a few patients had 10 or more ICU stays, though these cases are rare.
Rows: 546028 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q3.2 Summary and visualization
Summarize the following information by graphics and explain any patterns you see.
number of admissions per patient
admission hour (anything unusual?)
admission minute (anything unusual?)
length of hospital stay (from admission to discharge) (anything unusual?)
All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.
Solution:
number of admissions per patient
# Compute the number of admissions per patientadmissions_per_patient <- admissions_tble %>%group_by(subject_id) %>%summarise(num_admissions =n(), .groups ="drop")# Create a histogram to visualize the distributionp <-ggplot(admissions_per_patient, aes(x = num_admissions)) +geom_histogram(binwidth =1, fill ="#D81B60", color ="black", alpha =0.8) +# Use histogram instead of bar chartscale_x_continuous(limits =c(0, 30)) +# Limit the X-axis to focus on the most relevant rangelabs(title ="Number of Admissions per Patient",x ="Admissions per Patient",y ="Count of Patients") +theme_minimal() # Use a clean theme for better readability# Convert to an interactive plotggplotly(p)
Warning: Removed 504 rows containing non-finite outside the scale range
(`stat_bin()`).
rm(admissions_per_patient)
Most patients have only one or very few admissions, while very few patients have a large number of admissions. A small number of patients have multiple hospitalizations, but the frequency drops quickly as the number of admissions increases. Patients with 10+ admissions are very rare. The distribution is heavily right-skewed, with very few patients experiencing more than 5 hospital stays. This means that most patients have a low number of admissions, and a small subset of patients has a significantly higher number. Also, Most hospital patients are one-time visitors, while a minority (likely with chronic illnesses or complications) are readmitted multiple times.
admission hour (anything unusual?)
# Extract hour from admission timeadmissions_tble <- admissions_tble %>%mutate(admission_hour =hour(admittime))# Create a histogram of admission hoursp_hour <-ggplot(admissions_tble, aes(x = admission_hour)) +geom_histogram(binwidth =1, fill ="#1E88E5", color ="black", alpha =0.8) +scale_x_continuous(breaks =seq(0, 23, by =1)) +labs(title ="Distribution of Admission Hours",x ="Hour of Admission",y ="Count of Admissions") +theme_minimal()# Convert to interactive plotggplotly(p_hour)
The distribution of admission hours shows notable patterns. There is a sharp peak at midnight (00:00), suggesting that many admissions are recorded at the start of the day, possibly due to administrative rounding or batch processing of records. Another smaller peak is observed around 07:00–08:00, which could be associated with morning shift changes or scheduled hospital admissions. Admissions increase steadily from late morning and peak between 16:00–19:00, likely reflecting patient arrivals after daytime medical consultations and emergency cases during evening hours. The pattern indicates a combination of systemic recording practices, scheduled admissions, and emergency patient arrivals.
admission minute (anything unusual?)
# Extract minute from admission timeadmissions_tble <- admissions_tble %>%mutate(admission_minute =minute(admittime))# Create a histogram of admission minutesp_minute <-ggplot(admissions_tble, aes(x = admission_minute)) +geom_histogram(binwidth =1, fill ="#FFC107", color ="black", alpha =0.8) +scale_x_continuous(breaks =seq(0, 59, by =5)) +labs(title ="Distribution of Admission Minutes",x ="Minute of Admission",y ="Count of Admissions") +theme_minimal()# Convert to interactive plotggplotly(p_minute)
The distribution of admission minutes shows distinct spikes at 0, 15, 30, and 45 minutes, suggesting that admission times are often rounded to quarter-hour intervals. This pattern likely arises due to manual data entry practices, hospital system constraints, or administrative processes that batch-record admissions. The relatively uniform distribution in other minutes indicates that some admissions occur naturally without rounding, but the spikes highlight a systemic bias in how times are logged. This could impact time-based analyses and should be accounted for in further data processing.
length of hospital stay (from admission to discharge) (anything unusual?)
# Calculate length of stay in daysadmissions_tble <- admissions_tble %>%mutate(length_of_stay =as.numeric(difftime(dischtime, admittime, units ="days")))# Create a histogram of hospital stay lengthp_los <-ggplot(admissions_tble, aes(x = length_of_stay)) +geom_histogram(binwidth =1, fill ="#4CAF50", color ="black", alpha =0.8) +scale_x_continuous(limits =c(0, 50)) +# Adjusted limit for visualizationlabs(title ="Distribution of Hospital Stay Length",x ="Length of Stay (Days)",y ="Count of Admissions") +theme_minimal()# Convert to interactive plotggplotly(p_los)
Warning: Removed 2105 rows containing non-finite outside the scale range
(`stat_bin()`).
The distribution of hospital stay length is right-skewed, with most admissions having a short length of stay. The majority of patients are discharged within a few days, with the highest count around 1–3 days, suggesting that many hospital visits are for short-term treatments or observations. As the length of stay increases, the number of admissions decreases, indicating that extended hospitalizations are less common. The presence of a long tail suggests that a small number of patients require prolonged hospital care, possibly due to severe or chronic conditions. This pattern is typical in hospital data, where most cases are routine and resolved quickly, while complex cases require extended stays.
Rows: 364627 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q4.2 Summary and visualization
Summarize variables gender and anchor_age by graphics, and explain any patterns you see.
Solution:
# Count the number of patients by gendergender_dist <- patients_tble %>%count(gender)# Bar plot for gender distributionp_gender <-ggplot(gender_dist, aes(x = gender, y = n, fill = gender)) +geom_bar(stat ="identity", color ="black", alpha =0.8) +labs(title ="Gender Distribution",x ="Gender",y ="Count of Patients") +theme_minimal()# Convert to interactive plotggplotly(p_gender)
rm(gender_dist)
The bar chart indicates that there are more female (F) patients than male (M) patients in the dataset. The difference is noticeable but not extreme, suggesting a slight gender imbalance in hospital admissions. This could be influenced by factors such as longer life expectancy in females, higher healthcare utilization rates among women, or specific hospital demographics.
# Histogram of anchor_agep_age <-ggplot(patients_tble, aes(x = anchor_age)) +geom_histogram(binwidth =5, fill ="#1E88E5", color ="black", alpha =0.8) +labs(title ="Distribution of Anchor Age",x ="Age (Years)",y ="Count of Patients") +theme_minimal()# Convert to interactive plotggplotly(p_age)
The histogram of anchor_age shows a bimodal distribution, with a large concentration of patients in the 20–30 age range and another broader peak between 50–70 years old. The younger peak could represent a large number of maternity-related or younger adult admissions, while the older peak likely reflects age-related chronic conditions and elderly care. The drop-off after 80 years suggests a lower number of very elderly patients, possibly due to lower life expectancy or selection bias in hospital records.
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.
Solution:
# Connect to DuckDBcon <-dbConnect(duckdb::duckdb(), dbdir =":memory:")# Load and filter lab events directly in DuckDBdbExecute(con, "CREATE TABLE filtered_labevents AS SELECT subject_id, itemid, storetime, valuenum FROM read_parquet( '~/biostat-203b-2025-winter/hw3/labevents_parquet/' || 'part-0.parquet' ) WHERE itemid IN (50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)")
[1] 32679896
# Load ICU stays dataset from CSV and store it in DuckDBdbExecute(con, "CREATE TABLE icustays AS SELECT subject_id, stay_id, intime, outtime FROM read_csv_auto('~/mimic/icu/icustays.csv.gz')")
[1] 94458
# Join lab events with ICU stays and filter by time conditiondbExecute(con, "CREATE TABLE merged_labevents AS SELECT l.subject_id, i.stay_id, l.itemid, l.storetime, l.valuenum FROM filtered_labevents l INNER JOIN icustays i ON l.subject_id = i.subject_id WHERE l.storetime < i.intime")
[1] 20122551
# Select the most recent lab measurement per `itemid` before ICU admissiondbExecute(con, "CREATE TABLE latest_labs AS SELECT subject_id, stay_id, itemid, storetime, valuenum FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY subject_id, stay_id, itemid ORDER BY storetime DESC) AS rn FROM merged_labevents ) WHERE rn = 1")
[1] 677237
# Pivot `itemid` to become separate columnsdbExecute(con, "CREATE TABLE labevents_pivoted AS SELECT subject_id, stay_id, MAX(CASE WHEN itemid = 50882 THEN valuenum END) AS bicarbonate, MAX(CASE WHEN itemid = 50902 THEN valuenum END) AS chloride, MAX(CASE WHEN itemid = 50912 THEN valuenum END) AS creatinine, MAX(CASE WHEN itemid = 50931 THEN valuenum END) AS glucose, MAX(CASE WHEN itemid = 50971 THEN valuenum END) AS potassium, MAX(CASE WHEN itemid = 50983 THEN valuenum END) AS sodium, MAX(CASE WHEN itemid = 51221 THEN valuenum END) AS hematocrit, MAX(CASE WHEN itemid = 51301 THEN valuenum END) AS wbc FROM latest_labs GROUP BY subject_id, stay_id")
[1] 88086
# Fetch the final processed table into Rlabevents_tble <-dbGetQuery(con, "SELECT * FROM labevents_pivoted")# Arrange for better readabilitylabevents_tble <- labevents_tble %>%arrange(subject_id, stay_id)# View final datasetoptions(width =1000)print(head(labevents_tble, 10))
# Disconnect from DuckDBdbDisconnect(con, shutdown =TRUE)rm(con)
Q6. Vitals from charted events
chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.
Solution:
#Connect to DuckDBcon <-dbConnect(duckdb::duckdb(), dbdir =":memory:")# Load ICU stays dataset from CSV into DuckDBduckdb_table_icustays <-tbl( con, sql("SELECT * FROM read_csv_auto('~/mimic/icu/icustays.csv.gz')") )# Query ICU stays dataset within DuckDBresult_icustays <- duckdb_table_icustays %>%select(subject_id, stay_id, intime,outtime) %>%# Select necessary columnscollect() # Bring the filtered results into memory as a tibble
# Connect to DuckDBcon <-dbConnect(duckdb::duckdb(), dbdir =":memory:") # Load and filter only required `itemid` values (Reduce Memory Usage)dbExecute(con, "CREATE TABLE chartevents_duckdb AS SELECT subject_id, itemid, storetime, valuenum FROM read_parquet( '~/biostat-203b-2025-winter/hw3/chartevents_parquet/' || 'part-0.parquet' ) WHERE itemid IN (220045, 220179, 220180, 223761, 220210)")
[1] 30200193
# Load `icustays_tble` into DuckDBdbWriteTable(con, "icustays", result_icustays, overwrite =TRUE)
# Filter the measurements in between the icu timedbExecute(con, "CREATE TABLE latest_vitals_raw AS SELECT c.subject_id, i.stay_id, c.itemid, c.storetime, c.valuenum FROM chartevents_duckdb c INNER JOIN icustays i ON c.subject_id = i.subject_id WHERE c.storetime BETWEEN i.intime AND i.outtime")
[1] 30129155
rm(result_icustays)
# Select the first measurement (`storetime`) per `itemid` and `stay_id`dbExecute(con, "CREATE TABLE latest_vitals_filtered AS SELECT subject_id, stay_id, itemid, storetime, valuenum FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY subject_id, stay_id, itemid ORDER BY storetime ASC) AS rn FROM latest_vitals_raw ) WHERE rn = 1")
[1] 467599
# Pivot `itemid` into separate columns (one row per ICU stay)latest_vitals <-dbGetQuery(con, "SELECT * FROM latest_vitals_filtered")chartevents_tble <- latest_vitals %>%select(-storetime) %>%pivot_wider(names_from = itemid, values_from = valuenum, values_fill =list(valuenum =NA))# Create a named vector mapping `itemid` to readable namesitemid_mapping <-c("220045"="heart_rate","220179"="non-invasive_blood_pressure_systolic","220180"="non-invasive_blood_pressure_diastolic","223761"="temperature_fahrenheit","220210"="respiratory_rate")# Rename columns based on `itemid_mapping`colnames(chartevents_tble) <-c("subject_id", "stay_id", itemid_mapping[colnames(chartevents_tble)[-c(1:2)]]) rm(latest_vitals)
# Arrange for better readabilitychartevents_tble <- chartevents_tble %>%arrange(subject_id, stay_id)# Ensure columns appear in the correct orderdesired_order <-c("subject_id", "stay_id", "heart_rate", "non-invasive_blood_pressure_diastolic", "non-invasive_blood_pressure_systolic","respiratory_rate","temperature_fahrenheit")chartevents_tble <- chartevents_tble[, desired_order]# View the final datasetoptions(width =1000)print(head(chartevents_tble, 10))
# disconnect from DuckDBdbDisconnect(con, shutdown =TRUE)rm(con)
Q7. Putting things together
Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables
all variables in icustays_tble
all variables in admissions_tble
all variables in patients_tble
the last lab measurements before the ICU stay in labevents_tble
the first vital measurements during the ICU stay in chartevents_tble
The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.
Solution:
# Connect to DuckDBcon <-dbConnect(duckdb::duckdb(), dbdir =":memory:")# Write data into DuckDBdbWriteTable(con, "icustays_tble", icustays_tble, overwrite =TRUE)dbWriteTable(con, "admissions_tble", admissions_tble, overwrite =TRUE)dbWriteTable(con, "patients_tble", patients_tble, overwrite =TRUE)# Load ICU stays dataicustays_tble <-dbGetQuery(con, "SELECT * FROM icustays_tble")# Load admissions dataadmissions_tble <-dbGetQuery(con, "SELECT * FROM admissions_tble")# Load patients datapatients_tble <-dbGetQuery(con, "SELECT * FROM patients_tble")
# Write data into DuckDBdbWriteTable(con, "labevents_tble", labevents_tble, overwrite =TRUE)dbWriteTable(con, "chartevents_tble", chartevents_tble, overwrite =TRUE)# Load lab/chart eventslabevents_tble <-dbGetQuery(con, "SELECT * FROM labevents_tble")chartevents_tble <-dbGetQuery(con, "SELECT * FROM chartevents_tble")
# Merge ICU stays and admissions**mimic_icu_cohort <- icustays_tble %>%inner_join(admissions_tble, by ="subject_id") %>%inner_join(patients_tble, by ="subject_id")
Warning in inner_join(., admissions_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 48 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
# Add chart events (first recorded vital signs)**mimic_icu_cohort <- mimic_icu_cohort %>%left_join(chartevents_tble, by =c("subject_id", "stay_id"))#*Ensure each ICU stay corresponds to only one row**mimic_icu_cohort <- mimic_icu_cohort %>%distinct(subject_id, stay_id, .keep_all =TRUE)# Sort by subject_id and stay_id**mimic_icu_cohort <- mimic_icu_cohort %>%arrange(subject_id, stay_id)# Display the final datasetoptions(width =1000)print(head(mimic_icu_cohort, 10))
subject_id hadm_id.x stay_id first_careunit last_careunit intime outtime los hadm_id.y admittime dischtime deathtime admission_type admit_provider_id admission_location discharge_location insurance language marital_status race edregtime edouttime hospital_expire_flag admission_hour admission_minute length_of_stay gender anchor_age anchor_year anchor_year_group dod bicarbonate chloride creatinine glucose potassium sodium hematocrit wbc heart_rate non-invasive_blood_pressure_diastolic non-invasive_blood_pressure_systolic respiratory_rate temperature_fahrenheit
1 10000032 29079034 39553978 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU) 2180-07-23 14:00:00 2180-07-23 23:50:47 0.4102662 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 <NA> URGENT P49AFC TRANSFER FROM HOSPITAL HOME Medicaid English WIDOWED WHITE 2180-05-06 19:17:00 2180-05-06 23:30:00 0 22 23 0.7861111 F 52 2180 2014 - 2016 2180-09-09 25 95 0.7 102 6.7 126 41.1 6.9 91 48 84 24 98.7
2 10000690 25860671 37081114 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU) 2150-11-02 19:37:00 2150-11-06 17:03:17 3.8932523 23280645 2150-09-16 19:48:00 2150-09-24 13:50:00 <NA> EW EMER. P941QM EMERGENCY ROOM SKILLED NURSING FACILITY Medicare English WIDOWED WHITE 2150-09-16 16:00:00 2150-09-16 21:03:00 0 19 48 7.7513889 F 86 2150 2008 - 2010 2152-01-30 26 100 1.0 85 4.8 137 36.1 7.1 79 63 107 23 97.7
3 10000980 26913865 39765666 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU) 2189-06-27 08:42:00 2189-06-27 20:38:27 0.4975347 20897796 2193-08-15 01:01:00 2193-08-17 15:07:00 <NA> OBSERVATION ADMIT P55EL5 WALK-IN/SELF REFERRAL HOME HEALTH CARE Medicare English MARRIED BLACK/AFRICAN AMERICAN 2193-08-14 21:25:00 2193-08-15 02:22:00 0 1 1 2.5875000 F 73 2186 2008 - 2010 2193-08-26 21 109 2.3 89 3.9 144 27.3 5.3 77 77 150 23 98.0
4 10001217 27703517 34592300 Surgical Intensive Care Unit (SICU) Surgical Intensive Care Unit (SICU) 2157-12-19 15:42:24 2157-12-20 14:27:41 0.9481134 24597018 2157-11-18 22:56:00 2157-11-25 18:00:00 <NA> EW EMER. P3610N EMERGENCY ROOM HOME HEALTH CARE Private Other MARRIED WHITE 2157-11-18 17:38:00 2157-11-19 01:24:00 0 22 56 6.7944444 F 55 2157 2011 - 2013 <NA> 30 104 0.5 87 4.1 142 37.4 5.4 96 95 167 11 97.6
5 10001217 24597018 37067082 Surgical Intensive Care Unit (SICU) Surgical Intensive Care Unit (SICU) 2157-11-20 19:18:02 2157-11-21 22:08:00 1.1180324 24597018 2157-11-18 22:56:00 2157-11-25 18:00:00 <NA> EW EMER. P3610N EMERGENCY ROOM HOME HEALTH CARE Private Other MARRIED WHITE 2157-11-18 17:38:00 2157-11-19 01:24:00 0 22 56 6.7944444 F 55 2157 2011 - 2013 <NA> 22 108 0.6 112 4.2 142 38.1 15.7 86 90 151 18 98.5
6 10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU) Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22 2110-04-12 23:59:56 1.3385880 25563031 2110-04-11 15:08:00 2110-04-14 15:00:00 <NA> EW EMER. P32W56 PACU HOME Private English MARRIED WHITE <NA> <NA> 0 15 8 2.9944444 F 46 2110 2011 - 2013 <NA> NA 98 NA NA 4.1 139 NA NA 86 56 73 19 97.7
7 10001843 26133978 39698942 Medical/Surgical Intensive Care Unit (MICU/SICU) Medical/Surgical Intensive Care Unit (MICU/SICU) 2134-12-05 18:50:03 2134-12-06 14:38:26 0.8252662 21728396 2131-11-09 16:05:00 2131-11-11 11:23:00 <NA> OBSERVATION ADMIT P32VJE TRANSFER FROM HOSPITAL HOME HEALTH CARE Medicare English SINGLE WHITE <NA> <NA> 0 16 5 1.8041667 M 73 2131 2017 - 2019 2134-12-06 28 97 1.3 131 3.9 138 31.4 10.4 118 71 112 17 97.9
8 10001884 26184834 37510196 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU) 2131-01-11 04:20:05 2131-01-20 08:27:30 9.1718171 21192799 2130-10-05 20:04:00 2130-10-06 15:05:00 <NA> EU OBSERVATION P99AKB EMERGENCY ROOM <NA> Medicare English MARRIED BLACK/AFRICAN AMERICAN 2130-10-05 11:58:00 2130-10-06 15:05:00 0 20 4 0.7923611 F 68 2122 2008 - 2010 2131-01-20 30 88 1.1 141 4.5 130 39.7 12.2 38 12 180 16 98.1
9 10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU) Cardiac Vascular Intensive Care Unit (CVICU) 2160-05-18 10:00:53 2160-05-19 17:33:33 1.3143519 21516558 2169-12-03 14:12:00 2169-12-05 17:00:00 <NA> OBSERVATION ADMIT P11I4S WALK-IN/SELF REFERRAL HOME HEALTH CARE Medicaid English SINGLE WHITE 2169-12-02 19:30:00 2169-12-03 21:13:00 0 14 12 2.1166667 F 53 2156 2008 - 2010 <NA> 24 102 0.9 288 3.5 137 34.9 7.2 80 70 104 14 97.2
10 10002114 27793700 34672098 Coronary Care Unit (CCU) Coronary Care Unit (CCU) 2162-02-17 23:30:00 2162-02-20 21:16:27 2.9072569 27793700 2162-02-17 22:32:00 2162-03-04 15:16:00 <NA> OBSERVATION ADMIT P46834 PHYSICIAN REFERRAL HOME HEALTH CARE Medicaid English <NA> UNKNOWN 2162-02-17 19:35:00 2162-02-17 23:30:00 0 22 32 14.6972222 M 56 2162 2020 - 2022 2162-12-11 18 NA 3.1 95 6.5 125 34.3 16.8 111 80 112 20 97.9
# Disconnect from the databasedbDisconnect(con, shutdown =TRUE)rm(con)
Q8. Exploratory data analysis (EDA)
Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:
Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)
Length of ICU stay los vs the last available lab measurements before ICU stay
Length of ICU stay los vs the first vital measurements within the ICU stay